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Now that's what | 
call a dead parrot. 


These tutorials are a simplified 
introduction, and are not sufficient on 
their own to achieve system safety. 
You are responsible for the safety of 
your system. 
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— John Cleese 
(Monty Python) 





Carnegie 
_ Is Your System Dependable? as 
= Anti-Patterns for Dependability: 
e No concrete dependability goal 
e Confusing reliability vs. availability 
e Mission time is life of product 





= Can you trust your system? 
e Availability: fraction of up-time 
e Reliability: probability system will complete a mission 
e Other properties, such as: 
— Maintainability — Integrity 
— Confidentiality — Safety 


https://goo.gl/JwwxVH | www.cgpgrey.con 
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Ss Availability | IS “up time” Hours Since \ 
no, SE UpTime Last System Crash: 
Availability = nF ah . 
y ~— [OJ[O}[O}{O)(3) 
= Limits to availability 
e Frequency of system failures 99.9999% Availability Target: 
- Redundancy can improve availability = 2-6 seconds/month downtime 


y 4 
e Detection & repair time ; , 


— Detect, diagnose, repair failed component, restart the system 
— Time to reconfigure to redundant standby 
e Asa practical matter, 99.999% is considered “high availability” 
— 99.999% “Five nines’ = ~5 minutes/year down time 
—- 99.9999% “Six nines” =» 31.5 seconds/year down time © 2020 Philip Koopman 3 


Business >» Policy 
MS blames lowly techie for Web 
blackout 


Takes 22 hours to fix router config error 


By John Leyden 25 Jan 2001 at 11:48 SHARE ¥ 


Microsoft has blamed a lowly technician for a cock-up which almost 
completely blocked access to its Web sites for most users yesterday. 


From the early hours of yesterday morning until late evening 
www.microsoft.com, msn.com, expedia.co.uk and msnbc.com were all 
unavailable. The software giant's Hotmail service was also inaccessible 
for many. 


The problem, whose final resolution came some six hours after Microsoft 
promised a fix would be in place yesterday, was due to changes in 
Microsoft's domain name server network caused requests to access its 
Web sites to fail. A fix was eventually put in place when Microsoft 
removed the changes made to the configuration that were behind the 
problem. 


In a statement, Microsoft admitted: "At 6:30 p.m. Tuesday (PST), a 
Microsoft technician made a configuration change to the routers on the 
edge of Microsoft's Domain Name Server network. The DNS servers are 
used to connect domain names with numeric IP addresses (eg. 
207.46.230.219) of the various servers and networks that make up 
Microsoft's Web presence. 


"The mistaken configuration change limited communication between 
DNS servers on the Internet and Microsoft's DNS servers. This limited 
communication caused many of Microsoft's sites to be unreachable 
(although they were actually still operational) to a large number of 
customers.” 


https://www.theregister.co.uk/2001 


/01/25/ms_blames._lowly_techie/ THE MYTHICAL FIVE NINES. 99.999%. AS CLOSE TO PERFECT 
AS YOU CAN GET WITHOUT BREAKING SOME LAW OF NATURE. 





e Constant Failure Rated (failures/hr) 
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Measuring Reliability 


= Reliability is based on the concept of a “mission” 
e Reliability R(t): probability system still working since start of mission 
e A mission is t continuous operating hours between diagnostics 
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Failure Rate: 1-R(t) A= 10-/hr 
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Redundancy Improves Reliability —Ueity 
= Serial reliability . : he 
e Even good components Bran enouan: = ‘ | 4 
e E.g.: 0.9*0.9* 0.9 = 0.73 - ~~ 


RO) seriar 1] R(t); 


= Parallel reliability 
e Redundancy improves reliability 
e E.g., three @ 0.9 = 0.999 


RO Miata ed = (RONMCROMCROD) 


RO) paracer = 1 I] (1 ra R(t),) 
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Example Calculations fo 
= Reliability at MTBF R(1/lambda) is 36.8%, not 50%. Why? 


= What is reliability of this system for 3 hour mission? 
- A, =7 per million hours 
— A, = 200 per million hours 
— A, = 15000 per million hours 
— A,= 2 per million hours 
— R(3), = e3*7*10° = 9.999979 \ 
— R(3), = e3*200*10° = 9.999400 ed 
= R(3)_ = eo OUR 0° = 0.955997 R(3)paraLcer 
— R(3), = e3*2*10° = 9.999994 
— R(3)parauer = 1-[(1-R(3),)(1-R(3)q)(1-R(3)3)] = 0.999 999 999 45 
— R(3)rora, = R(3)paratrer R(3)4 = 0.999 999 999 45 * 0.999994 = 0.999994 
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https://bit.ly/2pzdJ7p 


t+ Delete Windows? 
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https://bit.ly/2O0XaH7I https:// q00.gl/ weer 
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Other Aspects of Dependability ear 





m Availability: up-time fraction Availablity 
Reliability 
= Reliability: no failures Attributes ee 
ontidentiality 
= Safety: no mishaps, no loss events i 
aintainability 
= Confidentiality: no disclosures Dependabilily | + —_ 
an reats rrors 
m Integrity: no corruption of state Security Failures 
° ° one ° Fault Prevention 
& Maintainability: system Can be fixed iat + Fault Tolerance 
J 5 3 + Fault Removal 
e E.g., ‘80% of failures can be fixed in 1 hour Sua Resanalin 
https://goo.gl/SyV4uZ 
=m Fault progression: 


e A fault is something that goes wrong (e.g., bit flip) 
e Anerror is an activated fault (e.g., flipped bit is read and used in a calculation) 
e A failure is when system does not provide required service (e.g., incorrect output) 


A. Avizienis ; J.-C. Laprie ; B. Randell ; C. Landwehr, "Basic concepts and taxonomy of dependable - 
and secure computing,” IEEE Trans. Dependability, Jan-Mar 2004, pp. 11-33 © 2020 Philip Koopman 9 


m Specify a dependability target 


= Minimize impact of any faults 
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Best Practices For Dependability Mellon 


“Never fails” is unrealistic 
Do you care about reliability or availability? 


Fault > Error =» System Failure 
Parallel redundancy usually helps 
Fast detection and reconfiguration 





= Pitfalls: 


Long missions without redundancy diagnosis/repair 
Non-redundant components are weak spot = single points of failure 
— Software failures are generally neither random nor independent 


Security matters too: attacks; outages for patches 7 
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.. RESTART MY COMPUTER? | | MY COMPUTER HAS NOTHING | |T DONT AAVE A START MENU, 
I KNOW YOU HAVE A SCRIPT} | TO DOWITH .,. OK, WHATEVER,| | THIS IS A HAIKU INSTALL, 

TO FOLLOW, BUT THE UPLINK | | I “RESTARTED th COMPUTER - BUT THATS NOT IMPORT— 
LIGHT ON THE MODEM IS GOING { 

OFF EVERY FEW HOURS, THE dyna Agel HAIKU? IT's AN EXPERIMENTAL 
PROBLEM Is BETWEEN YOUR ‘ OS THAT I ... OH, NEVER MIND, 
OFFICE AND THE MODEM. GOING To DIE AGAIN IN A 


IM SORRY, BUT THIS WON'T GET 
FIXED UNTIL I TALK TOAN ENGINEER. | |OVER WITHA STUFFED PENGUIN | | YOU, BUT MY CONNECTION — 
CAN YOU LOOK AROUND FOR SOMEONE | | DOLL AND A POSTER OF SOME YEAH, I SEE IT. 
WEARING CARGO PANTS, MAYBE A | | BEARDED DUDE WITH SWORDS. LINGERING PROBLEMS 
SUBWAY MAP ON THEIR WALL? PERFECT. CAN You 

PUT HER ON? 


SERIOUSLY? 


YUP. ITS A BACKDOOR 
PUT IN BY THE GEEKS 


NO PROBLEM, HEY, IN THE FUTURE, IF 
YOU'RE ON ANY TECH SUPPORT CALL, YOU 
CAN SAY THE CODE WORD “SHIBBOLEET” 
AT ANY POINT AND YOULL BE AUTOMATICALLY 
TRANSFERRED To SOMEONE WHO KNOWS A, 
MINIMUM OF TWO PROGRAMMING LANGUAGES, 





https://xkced.com/806/ 
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